Skip to content

Add MIG validations including RHEL setup#21

Merged
bogdando merged 2 commits into
rhos-vaf:mainfrom
rlandy:mig-remove-conditions
Jun 15, 2026
Merged

Add MIG validations including RHEL setup#21
bogdando merged 2 commits into
rhos-vaf:mainfrom
rlandy:mig-remove-conditions

Conversation

@rlandy

@rlandy rlandy commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Add MIG validations and remove GPU-specific conditionals

This PR adds validations for MIG configurations:

  • Add flavor configuration for MIG testing
  • Update MIG checks and NVIDIA validations to work on all GPU types
  • Add RHEL guest repository support
  • Add DKMS installation and reboot handling
  • Install NVIDIA repos with custom RPM
  • Make CUDA repo available for installation
  • Install nvidia-container-toolkit and configure CDI
  • Setup CDI and NVIDIA Management Library
  • Add GPU utilization option in vLLM service
  • Move CUDA install tasks to separate file
  • Remove A30-specific conditionals to support all GPU types with updated RHEL 9.6 drivers
  • Fix reboot sequencing for proper driver and CDI initialization

Co-Authored-By: Claude Sonnet 4.5
Signed-off-by: Ronelle Landy rlandy@redhat.com

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds gpu_validation_mode and related defaults; installs/configures NVIDIA CUDA repo and container tooling on RHEL for MIG; adds MIG detection and assertions; prepares RHEL kernel tooling for MIG; and wires mode-specific flavor extra_specs and an optional vllm service flag.

Changes

GPU Validation Mode Configuration and MIG Support

Layer / File(s) Summary
Configuration defaults and dependencies
gpu-validation/defaults/main.yaml, requirements.yaml
Establishes gpu_validation_mode default, gpu_validation_pci_alias, gpu_validation_extra_specs, gpu_validation_libnvidia_ml_version, and adds a ci-framework collection requirement.
NVIDIA CUDA repository and container runtime setup
gpu-validation/tasks/main.yaml, gpu-validation/tasks/nvidia-cuda-repos.yaml
Imports CUDA repo tasks on RedHat when mode is mig, creates NVIDIA CUDA YUM repo with GPG, installs nvidia-container-toolkit, conditionally configures CDI via nvidia-ctk, reboots VM, installs libnvidia-ml, and refreshes RPM facts.
MIG mode detection tasks
gpu-validation/tasks/nvidia.yaml
Adds tasks that run nvidia-smi (per-GPU UUID, PCI ID, MIG mode) and lsmod to capture NVIDIA module state; both register outputs and run only when gpu_validation_mode == "mig".
MIG validation assertions
gpu-validation/tasks/nvidia_assertions.yaml
Adds `when: found_nvidia
RHEL system preparation for MIG mode
gpu-validation/tasks/setup.yaml
Adds conditional RHEL/MIG setup block to run cifmw.general.repo_setup (rhos_release), add EPEL, and install DKMS and kernel-devel/kernel-headers via dnf4.
Flavor specs and service template configuration
gpu-validation/tasks/vm_image.yaml, gpu-validation/templates/vllm-serve.service.j2
Sets gpu_validation_extra_specs to resources:VGPU: "1" for MIG mode, updates flavor creation to use gpu_validation_extra_specs, and conditionally includes --gpu-memory-utilization in the vllm service ExecStart when set.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive No pull request description was provided by the author, making it impossible to assess whether the description relates to the changeset. Add a detailed pull request description explaining the changes, the rationale behind adding MIG mode support, and any configuration or testing considerations.
✅ Passed checks (4 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title 'Add MIG validations including RHEL setup' directly and clearly summarizes the main changes: it adds MIG (Multi-Instance GPU) validation logic and RHEL-specific setup steps throughout the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
gpu-validation/tasks/gpus.yaml (1)

12-15: ⚡ Quick win

Make A30 GPU detection more specific.

The string search for "A30" could match unintended substrings in lspci output (e.g., "GA30", "A300", or other device descriptions). Consider combining the NVIDIA and A30 checks or using a more specific pattern.

♻️ Recommended fix to improve detection accuracy
 - name: Set found_A30 to true if A30 is found
   ansible.builtin.set_fact:
     found_a30: true
-  when: lspci_output.stdout is search("A30")
+  when: lspci_output.stdout is search("NVIDIA.*A30")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@gpu-validation/tasks/gpus.yaml` around lines 12 - 15, The current GPU
detection sets found_a30 based on a loose substring match of lspci_output.stdout
for "A30"; tighten it by changing the conditional on the
ansible.builtin.set_fact that sets found_a30 to use a more specific regex that
ensures both the vendor and model are present (e.g., require "NVIDIA" near
"A30") or use word boundaries and case-insensitivity to avoid partial matches;
update the when clause referencing lspci_output.stdout and the found_a30 task to
use that stricter search (e.g., a regex like a vendor+model match or \bA30\b
with (?i) for case-insensitive matching).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@gpu-validation/defaults/main.yaml`:
- Line 28: The pci passthrough alias is hardcoded to "gpu-l4:{{
gpu_validation_flavor_gpus }}" which assumes a preconfigured Nova PCI alias;
make the alias configurable and provide guidance/tests: introduce a new variable
(e.g., gpu_validation_pci_alias) and replace the hardcoded "gpu-l4" in the
"pci_passthrough:alias" value with that variable (falling back to a sensible
default or empty to disable the extra-spec), update references to the template
key "pci_passthrough:alias" and the variable "gpu_validation_flavor_gpus", and
add a short note or unit test that validates the variable is set or documents
required Nova pci alias mappings.

In `@gpu-validation/tasks/nvidia-cuda-repos.yaml`:
- Around line 18-21: The task "Reboot the VM to find the installed drivers" in
gpu-validation/tasks/nvidia-cuda-repos.yaml is issuing an unconditional
ansible.builtin.reboot after installing nvidia-container-toolkit; remove or make
that reboot conditional so it doesn't force downtime. Replace the unconditional
reboot with either removal of the reboot task or wrap it with the same check
used by reboot_if_needed.yaml (the needs-restarting -r check / driver-reboot
condition), so the reboot runs only when needs-restarting indicates a
kernel/driver restart is required.
- Around line 40-41: The task that runs "nvidia-ctk runtime configure
--runtime=containerd" can fail on hosts without containerd or
/etc/containerd/config.toml; update gpu-validation/tasks/nvidia-cuda-repos.yaml
to either (a) ensure containerd is installed and configured before this task by
adding the containerd install/configure steps, or (b) add a guard that checks
for containerd presence (e.g., ansible.builtin.stat on /usr/bin/containerd or
/etc/containerd/config.toml and/or ansible.builtin.service status for
containerd) and only runs the nvidia-ctk command when that check succeeds
(otherwise skip or fail with a clear message).

In `@requirements.yaml`:
- Line 8: The git dependency
"git+https://github.com/openstack-k8s-operators/ci-framework.git" must be pinned
to a specific tag or commit to ensure reproducible installs; update that entry
by appending a tag or commit reference (e.g., use
"git+https://github.com/openstack-k8s-operators/ci-framework.git@v1.0.0" or a
commit SHA) so the requirements.yaml points to an immutable release instead of
the floating HEAD.

---

Nitpick comments:
In `@gpu-validation/tasks/gpus.yaml`:
- Around line 12-15: The current GPU detection sets found_a30 based on a loose
substring match of lspci_output.stdout for "A30"; tighten it by changing the
conditional on the ansible.builtin.set_fact that sets found_a30 to use a more
specific regex that ensures both the vendor and model are present (e.g., require
"NVIDIA" near "A30") or use word boundaries and case-insensitivity to avoid
partial matches; update the when clause referencing lspci_output.stdout and the
found_a30 task to use that stricter search (e.g., a regex like a vendor+model
match or \bA30\b with (?i) for case-insensitive matching).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: cafbb32f-07d3-429e-bb08-8523477d29a3

📥 Commits

Reviewing files that changed from the base of the PR and between 20704bc and d6bdb74.

📒 Files selected for processing (11)
  • gpu-validation/defaults/main.yaml
  • gpu-validation/tasks/gpus.yaml
  • gpu-validation/tasks/main.yaml
  • gpu-validation/tasks/nvidia-cuda-repos.yaml
  • gpu-validation/tasks/nvidia.yaml
  • gpu-validation/tasks/nvidia_assertions.yaml
  • gpu-validation/tasks/setup.yaml
  • gpu-validation/tasks/vm_image.yaml
  • gpu-validation/templates/vllm-serve.service.j2
  • main.yaml
  • requirements.yaml

Comment thread gpu-validation/defaults/main.yaml Outdated
Comment thread gpu-validation/tasks/nvidia-cuda-repos.yaml
Comment on lines +40 to +41
- name: Configure NVIDIA container runtime
ansible.builtin.command: nvidia-ctk runtime configure --runtime=containerd

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for containerd installation tasks
echo "=== Checking for containerd installation ==="
rg -n "name:.*containerd" --type yaml -C 2

# Check if containerd is a dependency or installed via dnf/yum
rg -n "containerd" --type yaml -C 3

# Check nvidia-ctk documentation or usage for runtime requirements
rg -n "nvidia-ctk.*runtime.*configure" --type yaml -C 3

Repository: rhos-vaf/gpu-validation

Length of output: 1212


Add a containerd dependency/presence check before configuring NVIDIA container runtime

gpu-validation/tasks/nvidia-cuda-repos.yaml runs nvidia-ctk runtime configure --runtime=containerd, but the repository’s YAML does not show any containerd install/presence step—only this containerd-related usage was found—so the command can fail on hosts without containerd and/or /etc/containerd/config.toml.

Ensure containerd is installed (and the expected config exists) before this task, or add a guard (e.g., stat/service check) to fail clearly or skip.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@gpu-validation/tasks/nvidia-cuda-repos.yaml` around lines 40 - 41, The task
that runs "nvidia-ctk runtime configure --runtime=containerd" can fail on hosts
without containerd or /etc/containerd/config.toml; update
gpu-validation/tasks/nvidia-cuda-repos.yaml to either (a) ensure containerd is
installed and configured before this task by adding the containerd
install/configure steps, or (b) add a guard that checks for containerd presence
(e.g., ansible.builtin.stat on /usr/bin/containerd or
/etc/containerd/config.toml and/or ansible.builtin.service status for
containerd) and only runs the nvidia-ctk command when that check succeeds
(otherwise skip or fail with a clear message).

Comment thread requirements.yaml Outdated
- name: community.general
version: ">=10.0.0"
- name: git+https://github.com/openstack-k8s-operators/edpm-ansible.git
- name: git+https://github.com/openstack-k8s-operators/ci-framework.git

@coderabbitai coderabbitai Bot Jun 9, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Pin ci-framework collection to a specific version.

The git dependency lacks version pinning (tag, branch, or commit SHA), which can cause non-reproducible builds and unexpected breakage if the upstream repository changes.

📌 Recommended fix to pin the dependency
-  - name: git+https://github.com/openstack-k8s-operators/ci-framework.git
+  - name: git+https://github.com/openstack-k8s-operators/ci-framework.git,main

Or better yet, pin to a specific tag or commit:

  - name: git+https://github.com/openstack-k8s-operators/ci-framework.git,v1.0.0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: git+https://github.com/openstack-k8s-operators/ci-framework.git
- name: git+https://github.com/openstack-k8s-operators/ci-framework.git,main
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@requirements.yaml` at line 8, The git dependency
"git+https://github.com/openstack-k8s-operators/ci-framework.git" must be pinned
to a specific tag or commit to ensure reproducible installs; update that entry
by appending a tag or commit reference (e.g., use
"git+https://github.com/openstack-k8s-operators/ci-framework.git@v1.0.0" or a
commit SHA) so the requirements.yaml points to an immutable release instead of
the floating HEAD.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edpm-ansible is not pinned - following that convention.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rlandy, fair point — edpm-ansible being unpinned establishes the convention for this repo, and ci-framework is consistent with that. I'll withdraw the suggestion.

(∪ ・ ω ・)∪

Comment thread requirements.yaml Outdated
- name: community.general
version: ">=10.0.0"
- name: git+https://github.com/openstack-k8s-operators/edpm-ansible.git
- name: git+https://github.com/openstack-k8s-operators/ci-framework.git

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we want ci-framework as a dependency?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea this isnt needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is - as that is the role we are using to install RHEL repos

Comment thread main.yaml Outdated
hosts: target_nodes
gather_subset:
- min
gather_facts: true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: was it missing some facts outside the min scope?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverting back to min

Comment thread gpu-validation/tasks/gpus.yaml Outdated
found_nvidia: true
when: lspci_output.stdout is search("NVIDIA")

- name: Set found_A30 to true if A30 is found

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please drop as no longer used

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@rlandy rlandy mentioned this pull request Jun 9, 2026
path: /etc/cdi/nvidia.yaml
register: nvidia_driver_cdi_config_file

- name: Configure NVIDIA container runtime

@MiguelCarpio MiguelCarpio Jun 9, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These (Check if CDI configfile exists and Configure NVIDIA container runtime) can be part of a configuration task after Install NVIDIA CUDA repos and Reboot if system updates require it tasks

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting by leaving the reboot where it is - let's see if the tasks can get by with it there.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reboot was required - conditional reboot task did not work here - added reboot back.

Comment thread gpu-validation/tasks/setup.yaml Outdated
gpgcheck: 1
gpgkey: "https://dl.fedoraproject.org/pub/epel/RPM-GPG-KEY-EPEL-{{ ansible_facts['distribution_major_version'] }}"

- name: Install DKMS (Not supported)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, let's leave the label (UNSUPPORTED) as standard

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@rlandy rlandy left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responding to comments

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
gpu-validation/tasks/setup.yaml (2)

28-35: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Pin kernel-devel and kernel-headers to the running kernel version.

Installing kernel-devel and kernel-headers without version constraints will fetch the latest available versions, which may not match the currently running kernel (ansible_kernel). DKMS module compilation requires an exact version match between the running kernel and the installed headers. A mismatch will cause the NVIDIA driver installation (via osp.edpm.edpm_accel_drivers) to fail, breaking the entire MIG validation flow.

This scenario is realistic: a system installed with an older kernel that hasn't rebooted after newer kernel packages became available in the repositories.

🔧 Proposed fix to pin kernel packages to running kernel version
     - name: Install DKMS (UNSUPPORTED)
       ansible.builtin.dnf:
         use_backend: dnf4
         name:
           - dkms
-          - kernel-devel
-          - kernel-headers
+          - "kernel-devel-{{ ansible_kernel }}"
+          - "kernel-headers-{{ ansible_kernel }}"
         state: present
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@gpu-validation/tasks/setup.yaml` around lines 28 - 35, The task "Install DKMS
(UNSUPPORTED)" installs kernel-devel and kernel-headers unpinned, which can pull
headers for a different kernel than the running one; change the package names to
pin to the current running kernel using the ansible fact (ansible_kernel) so you
install kernel-devel-{{ ansible_kernel }} and kernel-headers-{{ ansible_kernel
}} (leave dkms unpinned), ensuring the playbook gathers facts before this task
and that the package list in the ansible.builtin.dnf task uses those suffixed
names so DKMS can compile against the exact running kernel.

14-17: ⚠️ Potential issue | 🟠 Major

Fix kernel-mismatch risk and validate the external role task file

  • gpu-validation/tasks/setup.yaml installs kernel-devel/kernel-headers without pinning to the running kernel; DKMS builds typically require an exact kernel-devel match (add version pinning to ansible_kernel and/or ensure a reboot-before-DKMS flow).
  • gpu-validation/tasks/setup.yaml includes cifmw.general.repo_setup with tasks_from: rhos_release; local code doesn’t contain this role/tasks, and in ci-framework the rhos_release.yml usage appears associated with the reproducer role—confirm that cifmw.general.repo_setup actually contains a rhos_release (or rhos_release.yml) task file so the include doesn’t fail at runtime.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@gpu-validation/tasks/setup.yaml` around lines 14 - 17, In
gpu-validation/tasks/setup.yaml the play installs kernel-devel/kernel-headers
without pinning to the running kernel (risking DKMS build failures) and
includes_role name cifmw.general.repo_setup with tasks_from: rhos_release which
may not exist; update the package installation step to install kernel-devel-{{
ansible_kernel }} and kernel-headers-{{ ansible_kernel }} (or otherwise ensure a
reboot-before-DKMS flow) and verify/adjust the include_role usage: confirm that
the cifmw.general.repo_setup role actually contains a rhos_release (or
rhos_release.yml) task file and change tasks_from to the correct task name or
remove the tasks_from so the role loads its default tasks to avoid runtime
failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@gpu-validation/tasks/setup.yaml`:
- Around line 28-35: The task "Install DKMS (UNSUPPORTED)" installs kernel-devel
and kernel-headers unpinned, which can pull headers for a different kernel than
the running one; change the package names to pin to the current running kernel
using the ansible fact (ansible_kernel) so you install kernel-devel-{{
ansible_kernel }} and kernel-headers-{{ ansible_kernel }} (leave dkms unpinned),
ensuring the playbook gathers facts before this task and that the package list
in the ansible.builtin.dnf task uses those suffixed names so DKMS can compile
against the exact running kernel.
- Around line 14-17: In gpu-validation/tasks/setup.yaml the play installs
kernel-devel/kernel-headers without pinning to the running kernel (risking DKMS
build failures) and includes_role name cifmw.general.repo_setup with tasks_from:
rhos_release which may not exist; update the package installation step to
install kernel-devel-{{ ansible_kernel }} and kernel-headers-{{ ansible_kernel
}} (or otherwise ensure a reboot-before-DKMS flow) and verify/adjust the
include_role usage: confirm that the cifmw.general.repo_setup role actually
contains a rhos_release (or rhos_release.yml) task file and change tasks_from to
the correct task name or remove the tasks_from so the role loads its default
tasks to avoid runtime failures.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: e2df970f-d421-407d-ba63-fd58726c3169

📥 Commits

Reviewing files that changed from the base of the PR and between d6bdb74 and 02a8b86.

📒 Files selected for processing (3)
  • gpu-validation/defaults/main.yaml
  • gpu-validation/tasks/nvidia-cuda-repos.yaml
  • gpu-validation/tasks/setup.yaml
💤 Files with no reviewable changes (1)
  • gpu-validation/tasks/nvidia-cuda-repos.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • gpu-validation/defaults/main.yaml

Comment thread gpu-validation/tasks/setup.yaml Outdated
block:
- name: Run repo-setup for RHEL VMs
ansible.builtin.include_role:
name: cifmw.general.repo_setup

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do not make this depending on downstream cifmw.general.repo_setup 's rhos_release tool. Instead, provide a boostrap command variable which CI jobs can overload with rhos_release and repo enablement commands

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain more here? What would this look like?

@rlandy rlandy force-pushed the mig-remove-conditions branch from e05f415 to 952d2fe Compare June 12, 2026 05:21
This PR adds validations for MIG configurations:

- Add flavor configuration for MIG testing
- Update MIG checks and NVIDIA validations to work on all GPU types
- Add RHEL guest repository support
- Add DKMS installation and reboot handling
- Install NVIDIA repos with custom RPM
- Make CUDA repo available for installation
- Install nvidia-container-toolkit and configure CDI
- Setup CDI and NVIDIA Management Library
- Add GPU utilization option in vLLM service
- Move CUDA install tasks to separate file
- Remove A30-specific conditionals to support all GPU types with updated RHEL 9.6 drivers
- Fix reboot sequencing for proper driver and CDI initialization

Co-Authored-By: Claude Sonnet 4.5
Signed-off-by: Ronelle Landy <rlandy@redhat.com>
@rlandy rlandy changed the title Mig remove conditions Add MIG validations including RHEL setup Jun 12, 2026
@rlandy rlandy force-pushed the mig-remove-conditions branch from 952d2fe to 4c52584 Compare June 12, 2026 10:39

@bogdando bogdando left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bogdando bogdando merged commit 565225d into rhos-vaf:main Jun 15, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants